Expressive TTS Training With Frame and Style Reconstruction Loss
نویسندگان
چکیده
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system that improves the speech styling at utterance level. One of key challenges in prosody modeling is lack reference makes explicit difficult. The proposed technique doesn’t require annotations from data. It attempt to model explicitly either, but rather encodes association between input text and its styles using TTS framework. This study marks departure style token paradigm where modeled by bank embeddings. adopts combination two objective functions: 1) frame level reconstruction loss, calculated synthesized target spectral features; 2) deep features speech. loss formulated as perceptual ensure taken into consideration during training. Experiments show achieves remarkable performance outperforms state-of-the-art baseline both naturalness expressiveness. To our best knowledge, this first incorporate quality function Tacotron improved
منابع مشابه
Semantics and Discourse Processing for Expressive TTS
In this paper we present ongoing work to produce an expressive TTS reader that can be used both in text and dialogue applications. The system has been previously used to read (English) poetry and it has now been extended to apply to short stories. The text is fully analyzed both at phonetic and phonological level, and at syntactic and semantic level. The core of the system is the Prosodic Manag...
متن کاملDesigning speech database with prosodic variety for expressive TTS system
For the purpose of building speech synthesis system that can generate high-quality speech with wide range in prosody and realize fine prosody control, we propose new speech database constructing method. As a speech synthesis method, we select a hybrid system which consists of two part : speech unit selection and prosody modification part by STRAIGHT (vocoder type high quality analysis-synthesis...
متن کاملAdding speaking style to a TTS system
This paper aims to enhance the performance of a TTS system by generating various speaking styles. First we describe three speaking styles (Radio News, Political Address and Conversation) and compare the prosodic features found in these authentic styles with the prosody in “neutral” speech uttered by the eLite TTS system ([1]). Differences concern about 20 prosodic characteristics (F0 span, spee...
متن کاملEmotional Style Conversion in the TTS System with Cepstral Description
This contribution describes experiments with emotional style conversion performed on the utterances produced by the Czech and Slovak textto-speech (TTS) system with cepstral description and basic prosody generated by rules. Emotional style conversion was realized as post-processing of the TTS output speech signal, and as a real-time implementation into the system. Emotional style prototypes rep...
متن کاملExpressive language style among adolescents and adults with Williams syndrome
Language samples elicited through a picture description task were recorded from 38 adolescents and adults with Williams syndrome (WS) and one control group matched on age, and another matched on age, IQ, and vocabulary knowledge. The samples were coded for use of various types of inferences, dramatic devices, and verbal fillers; acoustic analyses of prosodic features were carried out, and an in...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2021
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2021.3076369